ECLeKTic: a Novel Challenge Set for
Evaluation of Cross-Lingual Knowledge Transfer

Omer Goldman^∗γβ, Uri Shaham^∗γ, Dan Malkin^γ,

Sivan Eiger^γ, Avinatan Hassidim^γ, Yossi Matias^γ, Joshua Maynez^δ,
Adi Mayrav Gilady^γ, Jason Riesa^δ, Shruti Rijhwani^δ, Laura Rimell^δ,
Idan Szpektor^γ, Reut Tsarfaty^γ, Matan Eyal^γ

^βBar-Ilan University ^γGoogle Research, ^δGoogle DeepMind
{ogoldman, urishaham, matane}@google.com

Abstract

To achieve equitable performance across languages, multilingual large language models (LLMs) must be able to abstract knowledge beyond the language in which it was acquired. However, the current literature lacks reliable ways to measure LLMs’ capability of cross-lingual knowledge transfer. To that end, we present ECLeKTic,¹¹1The dataset is available at https://www.kaggle.com/datasets/googleai/eclektic a multilingual closed-book QA (CBQA) dataset that Evaluates Cross-Lingual Knowledge Transfer in a simple, black-box manner. We detected information with uneven coverage across languages by controlling for presence and absence of Wikipedia articles in 12 languages. We generated knowledge-seeking questions in a source language, for which the answer appears in a relevant Wikipedia article and translated them to all other 11 languages, for which the respective Wikipedias lack equivalent articles. Assuming that Wikipedia reflects the prominent knowledge in the LLM’s training data, to solve ECLeKTic’s CBQA task the model is required to transfer knowledge between languages. Experimenting with 8 LLMs, we show that SOTA models struggle to effectively share knowledge across, languages even if they can predict the answer well for queries in the same language the knowledge was acquired in.

Omer Goldman^∗γβ, Uri Shaham^∗γ, Dan Malkin^γ, Sivan Eiger^γ, Avinatan Hassidim^γ, Yossi Matias^γ, Joshua Maynez^δ, Adi Mayrav Gilady^γ, Jason Riesa^δ, Shruti Rijhwani^δ, Laura Rimell^δ, Idan Szpektor^γ, Reut Tsarfaty^γ, Matan Eyal^γ ^βBar-Ilan University ^γGoogle Research, ^δGoogle DeepMind {ogoldman, urishaham, matane}@google.com

^*^*footnotetext: Equal contribution

Refer to caption — Figure 1: A model incapable of cross-lingual knowledge transfer (top box) can only answer factual questions in their source language, that is, the language in which the information appeared in its training. It cannot answer the same question when translated into another target language. A transfer capable model (bottom box), is able to answer questions no matter the language. ECLeKTic allows distinguishing between the two by targeting facts that unevenly distributed in the model’s training data as approximated, in the figure and in the paper, by Wikipedia.

1 Introduction

Ideally, multilingual large language models (LLMs; Gemini Team, 2024; Llama Team, 2024; OpenAI, 2024, inter alia) should perform similarly and consistently in all languages they were trained on and specifically be as knowledgeable across these languages. In addition to making models more human-like, this would enable speakers of languages with smaller resource footprints on the Web to have equitable access to the world’s knowledge, and speakers of higher resource languages, to access a wider range of information. Alas, LLMs are known to be inconsistent across languages (e.g., Ohmer et al., 2023), and in particular, their performance on factual queries varies significantly depending on the language in which the model is queried Jiang et al. (2020); Kassner et al. (2021); Qi et al. (2023).

LLMs may learn similar knowledge in different languages when such documents are encountered during pretraining. But, since observing all of the world’s knowledge in all languages, e.g., via translation, is not feasible, a complementary requirement from LLMs may be to share acquired knowledge between languages, irrespective of the language it was acquired in. Models with such capability would either represent knowledge in a language-agnostic way, or implicitly translate knowledge at inference time from the language in which it was acquired to the target language. However, measuring this knowledge-sharing capability is not trivial. Methods suggested in the literature require either causal interventions that are far from perfect and limited in their transparency Ifergan et al. (2024); Wei et al. (2024) or careful dissection of the model’s inner states that may be noisy Chen et al. (2024); Zhao et al. (2024a).

In this paper, we address the problem of empirically quantifying the cross-lingual knowledge-sharing ability of LLMs. Aiming for a simple, black-box evaluation, we introduce ECLeKTic to Evaluate Cross-Lingual Knowledge Transfer, by means of closed book QA. Consider, for example, a question asking who usually dubs characters played by Brad Pitt in German movies. Since in the German part of the internet it is well-known that the answer is Tobias Meister, a modern LLM is expected to answer this question easily when asked in German. However, to answer that in other languages, in which the internet contains little evidence of this fact, LLMs may struggle without being able to internally retrieve the German knowledge (see Figure 1).

To target facts like the one above, we constructed ECLeKTic, by targeting articles in Wikipedias in 12 languages that have no equivalent articles in the other languages, such as the article dedicated to Tobias Meister that exists only on the German Wikipedia. We generated fact-seeking question/answer pairs based on those articles using Gemini Gemini Team (2024) and translated them to the other tested languages. The entire generation and translation phases were manually verified by human annotators. The result is a set of questions and answers that relate to a fact that is well known only in one language (henceforth: the source language), but are contained in the dataset in all 12 languages (the target languages). Each question-answer pair is accompanied by the relevant Wikipedia context. Various LLMs were then asked these questions across all target languages in a closed-book setting and their predictions were judged by another model Chiang and Lee (2023); Zheng et al. (2023) in an open-book setting.

Based on the predictions for each question in the 12 languages, we defined 2 metrics: overall success that reflects the model’s ability to solve ECLeKTic as a whole by transferring knowledge across languages, and transfer ability that only measures the model’s ability to transfer correct answers. We experimented with 8 top-performing models, both open-source and proprietary, to demonstrate that, across the board, ECLeKTic poses a significant challenge. The best performing model, Gemini 2.0 Pro, achieves overall success of 41.3% and manages to transfer only 65.0% of the facts it was able to retrieve in the the respective source language. Breaking down the results by source and target language, we show that shared script is a major factor in the ease of transfer, corroborating findings from previous works Qi et al. (2023); Ifergan et al. (2024). Finally, we tested models of various sizes, all from the Qwen 2.5 model series Qwen Team (2025), and found that bigger models are not able to transfer more knowledge in relative terms, i.e., in terms of transfer ability, although they are more successful in terms of overall success.

All in all, the contribution of this paper is twofold. First, we introduce ECLeKTic, a novel benchmark for cross-lingual transfer evaluation, along with its construction and evaluation process. Second, we present a systematic evaluation of state-of-the-art models on ECLeKTic, showing lack of knowledge transfer across languages, leaving significant headroom for further research towards more capable and consistent multilingual models.

2 The Challenge of Cross-lingual Transfer Evaluation

With the rapid evolution of open-source and proprietary LLMs, we need robust black-box methods for evaluating and scrutinizing various model capabilities. In this work, we target the cross-lingual knowledge transfer abilities of multilingual LLMs. Specifically, we wish to assess whether consistent outputs for the same input in different languages occur due to genuine knowledge sharing — stemming from internal orthogonal treatment of knowledge and language — or due to incidental exposure and memorization of the same information in multiple languages during training.

To the best of our knowledge, prior work for directly assessing parametric knowledge-sharing and cross-lingual knowledge-transfer capabilities mainly looked at the internal mechanisms of open-source models. One line of work utilizes knowledge editing to determine whether the model’s representation of knowledge is causally linked across languages Ifergan et al. (2024); Wei et al. (2024). However, editing methods are far from perfect as the field is still evolving. Another approach uses direct observation of neuron activations in order to understand the extent to which actual transfer occurs Chen et al. (2024); Zhao et al. (2024a). One downside of this approach is that models’ inner states are not easily interpretable. Yet, more limiting is the fact that all the above methods are white-box inspection approaches that can only be applied to open-source models. These methods cannot assess modern SOTA proprietary models, for which only black-box analysis is possible.

In this work, we present ECLeKTic to black-box evaluate cross-lingual knowledge transfer using a well-established method of closed-book QA to query parametric knowledge AlKhamissi et al. (2022) by carefully selecting questions that would indicate genuine knowledge transfer.

3 ECLeKTic

In this section we introduce the ECLeKTic benchmark. We detail the construction of the dataset and then describe the evaluation procedure and metrics.

3.1 The ECLeKTic Dataset Construction

As ECLeKTic is a QA benchmark, we want to include only questions whose answers were exposed to the model in a single language during pre-training. Then, when we query such questions in other languages, an LLM could answer them from its parametric memory if it has a representation of that knowledge that is language agnostic.

To generate such question/answer pairs, we selected articles in Wikipedia that exist only in one of the 12 target languages (see list in Table 1). Concretely, we analyzed the July 2023 Wikipedia dump and for each language sampled 100 articles that contain at least 200 characters, had at least 100 views during 2023,²²2Statistics available on: https://stats.wikimedia.org/#/all-wikipedia-projects. and most importantly do not have equivalent articles in any of the other 11 languages. From each such article we extracted the 10 first sentences and based on them we instructed Gemini to generate a question and an answer.

The context, the question, and the answer of all candidate generations were validated by human annotators. First, the annotators checked that the question is answerable in a closed book setting, i.e., it does not refer explicitly to the context or mention the answer. Second, they validated that the question is related to a fact that is particularly relevant to the language in question, e.g., it does not relate to a science or other general knowledge fact. Questions and answers that did not meet these criteria were discarded. Third, the annotators made sure that the question contains all the information needed to be answerable when translated. For example, a question in Hebrew relating to the TV series "Survivor" was disambiguated by the annotators to explicitly mention "the Israeli adaption of Survivor". Named entities were also clarified similarly, so a question referring to "Ambev" was modified refer to "the Brazilian brewing company, Ambev".

Finally, each retained question, answer and context were automatically translated to the other 11 languages. The translations were verified by another set of human annotators and modified when needed. At this stage some examples were also discarded if they were proved to be untranslatable. For example, when a question explicitly refers to the meaning of a word in the source language. To overcome the difficulties in the verification of translation between non-trivial language pairs, this stage was done through English as a pivot language.

The complete annotation process is depicted in Figure 2. All prompts that were used in the data creation are detailed in Appendix A. The statistics of the final benchmark are depicted in Table 1.

Source Language	# Examples
English	39
French	22
German	16
Hebrew	29
Hindi	64
Indonesian	31
Italian	29
Japanese	26
Korean	33
Mandarin Chinese	35
Portuguese	28
Spanish	32
total generated	384
total translated	4224
total evaluated	4608

Table 1: Statistical information on the examples in ECLeKTic, broken down by source language.

3.2 ECLeKTic Metrics

To empirically assess the cross-lingual knowledge-transfer capabilities over ECLeKTic, we devise two metrics that accompany the benchmark. The first is overall success, which measures the extent to which a model succeeds in answering correctly the questions in ECLeKTic, in both source and target language. The second is transfer score, which measures the success in the knowledge transfer itself, taking into account only the questions answered correctly in their source language.

Formally, for each question/answer pair we define example-level success, based on the target language in which they are written $l_{t}$ and on the source language from which they were translated $l_{s}$ , as

S^{q,a}_{M}=\mathds{1}\big{(}M(q_{l_{t}})=a_{l_{t}}\land M(q_{l_{s}})=a_{l_{s}% }\big{)}

where $M(q_{l})$ is the prediction of model $M$ for question $q$ in language $l$ . $S^{q,a}_{M}$ is a 0/1 score that indicates whether questions that are expected to be correctly answered in language $l_{s}$ are indeed answered correctly both in this language as well as in another target language $l_{t}$ , capturing positive knowledge sharing between the two languages. Given a set of examples $D$ , the final metric overall simply averages across all question/answer pairs.

S^{\text{overall}}_{M}=\frac{\sum_{(\{q,a\}\in D)}{S^{q,a}_{M}}}{|D|}

$S^{\text{overall}}$ is the main metric of ECLeKTic.

To strictly measure the knowledge transfer probability, we consider only question/answer pairs for which the model provided the correct answer in the source language: $K=\{(q,a)|M(q_{l_{s}})=a_{l_{s}}\}$ . We then define transfer score as the number of questions that were answered correctly by the model in $l_{t}$ given that they were answered correctly in $l_{s}$ :

S^{\text{transfer}}_{M}=\frac{\sum_{(\{q,a\}\in K)}{S^{q,a}_{M}}}{|K|}

Note that this metric does not explicitly reflect the number of questions answered correctly in their source language. As a result, it can be maximized even by weak models, for which $|K|$ is small, as long as they can answer the same questions in all languages.

To determine whether a model gives the correct answer to a specific question in a specific language $M(q_{l})\stackrel{{\scriptstyle?}}{{=}}a_{l}$ we use an LLM as a judge Zheng et al. (2023). This is done in order to avoid the pitfalls of automatic metrics that are less correlated with human judgements Chen et al. (2019). The judge model, in our case Gemini 2.0 Flash, receives as input the question and the prediction of the tested model as well as the translated context in order to verify that the predicted answer is correct Zhou et al. (2025). The prompt used for judgement is in Appendix A.

3.3 Assumptions in ECLeKTic

There are several important assumptions we made in the construction of ECLeKTic, which we make explicit here.

First, we assume that all the tested models are exposed to the articles we extracted from Wikipedia. Moreover, we assume that the model was exposed to the information in those articles in their respective source languages multiple times during its training and the knowledge is therefore more accessible to it. In practice, Wikipedia itself is repeated in the pre-training data of most LLMs due to the quality of its texts Brown et al. (2020); Chowdhery et al. (2023), and we assume that it holds also for all multilingual models. Additionally, we assume that the existence of a Wikipedia article reflects a general interest of online speakers in the same topic so the information is also likely to be repeated in the same language outside of Wikipedia. This is even more straightforward given that we only targeted articles with significant yearly view count.

Conversely, we assume that the absence of an article from a certain Wikipedia reflects the lack of interest of online speakers in that topic. The information on that topic is therefore assumed to appear sparsely on the internet, if at all, and be far less accessible to the model in that language.

Together, these assumptions allow us to treat the (in)existence of a fact in Wikipedia as an approximate to the (in)exposure of a fact in a specific language to a multilingual model that was trained on the entire internet.

4 Experiments

4.1 LLMs Struggle with Knowledge Transfer

Model	Overall	Transfer
Gemini 2.0 Pro	41.6 $\pm$ 1.5	65.0 $\pm$ 1.8
GPT 4o	38.8 $\pm$ 1.4	67.0 $\pm$ 1.8
Gemini 2.0 Flash	34.6 $\pm$ 1.4	62.3 $\pm$ 1.9
Claude 3.5 Sonnet	34.4 $\pm$ 1.4	60.8 $\pm$ 1.9
Gemma 2 9B	8.7 $\pm$ 0.8	40.3 $\pm$ 3.1
Mistral Nemo	7.1 $\pm$ 0.8	38.9 $\pm$ 3.4
Qwen 2.5 7B	2.8 $\pm$ 0.5	23.5 $\pm$ 3.7
Olmo 2 7B	1.6 $\pm$ 0.3	17.2 $\pm$ 3.7

Table 2: Performance for all proprietary and open models over all examples in ECLeKTic in both metrics.

To demonstrate ECLeKTic’s value in cross-lingual transfer evaluation, we measured the performance of several models. We included open models, namely the latest versions of Gemma,³³3https://huggingface.co/google/gemma-2-9b-it Mistral,⁴⁴4https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407 Qwen⁵⁵5https://huggingface.co/Qwen/Qwen2.5-7B-Instruct, and Olmo,⁶⁶6https://huggingface.co/allenai/OLMo-2-1124-7B-Instruct, all in sizes of between 7 and 9 billion parameters that underwent instruction tuning. In addition, the black-box nature of ECLeKTic allows the evaluation of closed models as well. We therefore included also GPT 4o,⁷⁷7We used version gpt-4o-2024-11-20. Claude 3.5 Sonnet,⁸⁸8Version claude-3-5-sonnet-20241022. and Gemini 2.0 in both Pro⁹⁹9Currently an experimental release. and Flash¹⁰¹⁰10Version gemini-2.0-flash-001 versions.

All models were evaluated in a zero-shot setting and were prompted with the question on its own, i.e., without an explicit instruction. The results of can be found in Table 2. They show a clear gap in performance between the two groups, with proprietary models clearly outperforming the open ones by a very wide margin, probably due to their bigger size in terms of parameters (see also Section 4.3).

Gemini 2.0 Pro is the best performing model in solving the task of ECLeKTic, as defined in terms of overall success. It manages to answer correctly, in both source and target language, 41.3% of the examples. This score reveals that there is still a clear room for improvement in knowledge retrieval and transfer capabilities of all models. In terms of the transfer score, that is the portion of questions that were answered correctly in the target language out of those that got the right answer in their source languages, the performance in much better, with all closed models achieving more than 60%, but it is still far from perfect.

All in all, we conclude that ECLeKTic presents a serious challenge to modern LLMs despite their impressive abilities overall.

4.2 Shared Script Eases Transfer

In order to provide further insight into the factors that affect knowledge transfer, Figure 3 details a breakdown per language of our results in terms of transfer score of our best performing model, Gemini 2.0 Pro. This analysis shows that the model’s ability to transfer knowledge is highly dependent on the source and target language, with average scores ranging from 23.5 (transfer from Portuguese to Japanese) to 100.0 (from German to Indonesian).

It is also possible to see that transfer is much higher between languages with the same script. This is evident from the high transfer score between German, English, French, Spanish, Italian, Portuguese, and Indonesian. Note that the latter is not genealogically related to the rest of the Latin-written languages, yet it performs on par with the others when serving as a source or a target to other Latin-written languages. The dependence on script can also be seen in the transfer scores between Chinese and Japanese, especially when compared to the transfer from Japanese to other languages. This finding aligns with previous works in the literature on the importance of script to transfer Malkin et al. (2022); Mittal et al. (2023); Ifergan et al. (2024).

Additionally, this analysis reveals an asymmetry in transfer scores depending on the role of the language, as a source or a target. For example, knowledge seems to be easily transferable from Hindi, mostly to Latin-written languages and to Chinese, resulting in a macro-averaged transfer score of 78.6. But the transfer to Hindi is much worse, averaging in only 59.6 when Hindi is the target language.

A similar breakdown for the overall success metric can be found in Appendix B.

4.3 Bigger Isn’t Necessarily Better

Gemini 2.0 Flash	Overall	Transfer
Closed-book	34.5 $\pm$ 1.4	62.3 $\pm$ 1.9
General hint	35.3 $\pm$ 1.4	64.4 $\pm$ 1.9
Source language name	41.4 $\pm$ 1.5	70.0 $\pm$ 1.8
Source language title	47.4 $\pm$ 1.4	75.8 $\pm$ 1.6
Cross-lingual open-book	94.3 $\pm$ 0.7	96.0 $\pm$ 0.6

Table 3: Performance of Gemini 2.0 Flash when prompted with hints adding increasing amounts of information, from a general hint to use knowledge in another language to a cross-lingual open-book QA.

To further explore the role of model size in the ability to transfer knowledge, we evaluated all 7 model sizes of the Qwen 2.5 series, in sizes of 0.5, 1.5, 3, 7, 14, 32, and 73 billion parameters. We plotted the transfer and overall scores of these models in Figure 4.

The difference in the curves of the transfer and the overall scores is clearly evident. The performance in terms of overall score continues to improve with almost every increase in model size, and it seems plausible that an even bigger version of Qwen would have an even better performance. On the other hand, in terms of transfer score, the performance improves rapidly with the increase in model size up until 14B parameters and then somewhat saturates, as more than a 5-fold increase in model size only give an improvement of about 7 percentage points in transfer. Taken together, this leads to the conclusion that the improvement in the overall scores of the bigger model comes mostly from their ability to retrieve more facts in their source language while the proportion of facts transferred rises only marginally.

4.4 Hinting LLMs Into Success Is Not Easy

In the evaluation setting we examined until now — a zero-shot setting where models are only given the question itself with no instruction — we found that models have a significant room for improvement. In order to characterize the pitfalls that make models fail and how to avoid them, we conducted an ablation study by incorporating more and more information into the prompt given to the model. We experimented with Gemini 2.0 Flash in the following settings:

•

Closed-book. This is the setting used for the main experiment, giving the model only the question to be answered.
•

General hint. In this setting an instruction is given to the model to answer the question while also instructing it to use its knowledge in other languages if it sees fit.
•

Source language name. This setting is very similar to the previous one but the name of language with the relevant knowledge is given explicitly as part of the prompt.
•

Source language title. Here the model is given not only the name of the language but also the title of the Wikipedia article from which the question was generated. The title is given in its source language.
•

Cross-lingual open-book. In this last setting, the prompt includes the context in its source language and the question in the target language, so the model can completely disregard its parametric knowledge and it is only required to bridge the language differences in its input. This experiment is equivalent to that done by Chua et al. (2024).

The prompts used in all of these settings are given in Appendix A.

The results, given in Table 3, show that while simply hinting the model towards cross-lingual knowledge is not enough and provides only insignificant improvement. Revealing the source language and the topic leads to a much improved performance of about 7 and 12 percentage points, respectively, indicating that in some cases the knowledge may be partially available to the model, but it requires some guidance.

However, the most substantial improvement, almost to the point of solving ECLeKTic, comes when we gave the model the correct context, just in the source language. The model has no problem to reason across languages to produce the correct answer in 94.3% of the examples. This means that when the knowledge is available in the prompt, transferring it is less of a problem. The limited performance over ECLeKTic is then more likely to arise from the difficulty in retrieving knowledge cross-lingually rather than processing it.

5 ECLeKTic Popular Pages

Model	Popular Pages
Model	Overall	Transfer
Gemini 2.0 Pro	36.7 $\pm$ 1.2	67.8 $\pm$ 1.7
GPT 4o	34.5 $\pm$ 1.2	67.3 $\pm$ 1.7
Gemini 2.0 Flash	31.6 $\pm$ 1.2	65.8 $\pm$ 1.8
Claude 3.5 Sonnet	23.8 $\pm$ 1.1	56.8 $\pm$ 2.0
Gemma 2 9B	6.0 $\pm$ 0.6	33.0 $\pm$ 2.9
Mistral Nemo	4.0 $\pm$ 0.5	31.5 $\pm$ 3.4
Qwen 2.5 7B	1.5 $\pm$ 0.3	18.9 $\pm$ 3.6
Olmo 2 7B	0.8 $\pm$ 0.2	9.3 $\pm$ 2.5

Table 4: Performance for all proprietary and open models over all examples in ECLeKTic popular pages in both metrics.

Beyond reliance on Wikipedia article distribution, when constructing ECLeKTic we also made a decision to base the questions and answers on pages that exist in only one language out of the 12 in our selection. While providing control over the language from which the knowledge has to be transferred, it also means that ECLeKTic may include many questions on topics that are somewhat marginal and less consequential to users in the target languages.

To examine whether our results are due to limited exposure of the models to their topics, we experimented with another variant of our dataset, namely ECLeKTic popular pages. In this version we sampled articles that were popular in terms of views, instead of prioritizing lack of equivalents. Thus, this version gives emphasis to topics that are more likely to interest average users and appear more frequently outside Wikipedia.

Specifically, from each Wikipedia we took the 200 articles that had the most views during April 2023 and lacked equivalent in at least one of the 12 Wikipedias. Then,S through the same pipeline described in Section 3.1, we created, translated and verified questions and answers based on the content of the articles. We ended up with 964 unique question/answer pairs. However, each example was evaluated on a subset of target languages consisting only of languages whose Wikipedias do not include an article on the question’s topic, resulting in a total of 6,628 examples summed over all languages.¹¹¹¹11The data for this variant is also included in https://www.kaggle.com/datasets/googleai/eclektic.

We evaluated the same models of Section 4.1 on this data as well. The results, detailed in Table 4, are inline with the results over the main ECLeKTic dataset in Table 2. The order of models by performance didn’t change, Gemini 2.0 Pro is the best model, followed by GPT 4o, and open 7-9B models far worse. This experiment indicates on the robustness of our results to the article selection criterion and on the validity of our assumptions about the link between Wikipedia articles distribution and the necessity of transfer.

6 Related Work and Discussions

6.1 Types of Cross-Lingual Transfer

The term cross-lingual transfer has been extensively used in the NLP literature. The idea of one language benefiting from resources in another through a shared representation goes back to McDonald et al. (2011) at least. However, this term was used to refer to different specific experimental settings, some of them significantly different than the setting of this work.

The first distinction in the literature concerns what is being transferred, specifically the difference between cross-lingual skill transfer and knowledge transfer (Rajaee and Monz, 2024). Cross-lingual skill transfer is the ability to generalize a given skill to unseen languages regardless of the language that was used to learn it. E.g. perfecting multilingual summarization when learning to summarize from English data, or excelling at multilingual instruction following while only exposed to a few languages (Hu et al., 2020; Turc et al., 2021; Malkin et al., 2022; Huang et al., 2023; Shaham et al., 2024). On the other hand, cross-lingual knowledge transfer is the ability to retrieve factual knowledge from one language’s data when queried with any language (Asai et al., 2020; Limkonchotiwat et al., 2022; Mittal et al., 2023; Chua et al., 2024; Litschko et al., 2024).

In addition, methods for evaluation of cross-lingual transfer can also be orthogonally categorized into two broad approaches: fine-tuning on sources, then testing on targets (e.g., Shaham et al., 2024; Limkonchotiwat et al., 2022) vs. zero-shot evaluation of the targets (e.g., Malkin et al., 2022; Chua et al., 2024). While the earlier enables higher control over the data and experimental setting, the latter is less expensive and reflects how models behave in the wild.

Lastly, when dealing with cross-lingual knowledge transfer, the knowledge can be either parametric (e.g., Rajaee and Monz, 2024) or contextual (e.g., Chua et al., 2024; Mondshine et al., 2025), a distinction explored by Neeman et al. (2022) in a monolingual setting.

Within this taxonomy of cross-lingual transfer related works, ECLeKTic clearly belongs to methods evaluating parametric knowledge transfer in a zero-shot setting. In addition, the experiment in Section 4.4 gradually transition from parametric to contextual knowledge transfer, with the cross-lingual open-book experiment occupying the other end of that spectrum.

6.2 ECLeKTic and Cross-Lingual Consistency

Cross-lingual knowledge transfer is closely related to cross-lingual consistency, which has been explored extensively in recent literature. Starting with monolingual English settings, early works noticed that LLMs lack a guarantee on consistency due to the statistical nature of their training. That is to say that models may generate contradictory statements when presented with semantically equivalent inputs Kassner and Schütze (2020); Ravichander et al. (2020) or may rephrase identical factual information in different ways Elazar et al. (2021). In the context of modern LLMs, there has been increasing attention on ensuring output consistency when faced with variations in prompts Mizrahi et al. (2024); Sclar et al. (2024); Zhao et al. (2024b). A variety of methods have been proposed to improve this consistency, including the augmentation and diversification of instructions during training Liang et al. (2023); Zhao et al. (2024b).

In multilingual models, the challenge becomes even more complex, as it introduces the issue of cross-lingual inconsistency, where models fail to provide consistent responses to semantically equivalent inputs in different languages. To assess cross-lingual consistency, earlier works translated English datasets, either derived from knowledge bases Kassner et al. (2021); Jiang et al. (2020); Ifergan et al. (2024) or from NLP tasks Ohmer et al. (2023), into multiple languages. These studies then measured the variance in LLM responses, typically in terms of answer overlap or ranking Qi et al. (2023).

However, when applying translation across-the-board to entire datasets, these consistency-focused benchmarks may include some well-known facts that the model saw and memorize in many languages separately and therefore could easily predict consistently (see discussion in Section 2). Moreover, starting with English-constructed datasets may introduce biases into these benchmarks.

The QA task of ECLeKTic, although created with transfer in mind, may also serve as a better benchmark for cross-lingual consistency. To begin with, ECLeKTic is not generated in English but in all languages of the dataset. But more importantly, by targeting specific knowledge that is less known in most languages, we present models with a far greater challenge in keeping their answers consistent. It is therefore possible to view the results over ECLeKTic also as a tighter upper bound on the consistency of multilingual models.

7 Conclusions

In this paper we presented ECLeKTic, a closed-book QA dataset for evaluating the abilities of models to transfer knowledge from their parametric memory across languages. This black-box evaluation was made possible by carefully phrasing questions that target topics that are highly visible in one language and not in any of the others. Our benchmark allows for simple and reliable transfer evaluation which is, for the first time, easily applicable also to API-fenced models. Our results show that cross-lingual knowledge transfer is a difficult task that is far from being solved, and that ECLeKTic can be employed to indicate on the progress made towards consistent and inclusive multilingual language models.

8 Limitations

Time sensitivity

When constructing ECLeKTic we relied on the distribution of topics in Wikipedia in different languages as reflected in July 2023. Since then, and in the future, it is of course possible for the article distribution to change, mostly as new topics become more prominent for the speakers of a given language, on Wikipedia and in general. This makes ECLeKTic somewhat time-dependent, so in the future it would probably require an update.

Number of languages

Although the data ECLeKTic covers varied languages, its coverage is obviously partial. It covers only 6 out of the 10 most spoken languages according to Ethnologue,¹²¹²12https://www.ethnologue.com/statistics/ and 8 out of the 10 most active Wikipedias in terms of active users.¹³¹³13https://en.wikipedia.org/wiki/List_of_Wikipedias However, due to the translation of all examples to all languages, adding significantly more languages may make the evaluation using ECLeKTic less cheap and fast.

The limited number of languages also make the conclusions of Section 4.2 less unequivocal, but the fact the previous works pointed to the same conclusions Malkin et al. (2022); Mittal et al. (2023); Ifergan et al. (2024) provides some reassurance.

References

AlKhamissi et al. (2022) Badr AlKhamissi, Millicent Li, Asli Celikyilmaz, Mona Diab, and Marjan Ghazvininejad. 2022. A review on language models as knowledge bases. arXiv preprint arXiv:2204.06031.
Asai et al. (2020) Akari Asai, Jungo Kasai, Jonathan H Clark, Kenton Lee, Eunsol Choi, and Hannaneh Hajishirzi. 2020. Xor qa: Cross-lingual open-retrieval question answering. arXiv preprint arXiv:2010.11856.
Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
Chen et al. (2019) Anthony Chen, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. 2019. Evaluating question answering evaluation. In Proceedings of the 2nd Workshop on Machine Reading for Question Answering, pages 119–124, Hong Kong, China. Association for Computational Linguistics.
Chen et al. (2024) Yuheng Chen, Pengfei Cao, Yubo Chen, Kang Liu, and Jun Zhao. 2024. Journey to the center of the knowledge neurons: Discoveries of language-independent knowledge neurons and degenerate knowledge neurons. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 17817–17825.
Chiang and Lee (2023) Cheng-Han Chiang and Hung-yi Lee. 2023. Can large language models be an alternative to human evaluations? In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15607–15631, Toronto, Canada. Association for Computational Linguistics.
Chowdhery et al. (2023) Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. 2023. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113.
Chua et al. (2024) Lynn Chua, Badih Ghazi, Yangsibo Huang, Pritish Kamath, Ravi Kumar, Pasin Manurangsi, Amer Sinha, Chulin Xie, and Chiyuan Zhang. 2024. Crosslingual capabilities and knowledge barriers in multilingual large language models. In NeurIPS 2024 Workshop on Compositional Learning: Perspectives, Methods, and Paths Forward.
Elazar et al. (2021) Yanai Elazar, Nora Kassner, Shauli Ravfogel, Abhilasha Ravichander, Eduard Hovy, Hinrich Schütze, and Yoav Goldberg. 2021. Measuring and improving consistency in pretrained language models. Transactions of the Association for Computational Linguistics, 9:1012–1031.
Gemini Team (2024) Gemini Team. 2024. Gemini: A family of highly capable multimodal models. Preprint, arXiv:2312.11805.
Hu et al. (2020) Junjie Hu, Sebastian Ruder, Aditya Siddhant, Graham Neubig, Orhan Firat, and Melvin Johnson. 2020. Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalisation. In International Conference on Machine Learning, pages 4411–4421. PMLR.
Huang et al. (2023) Haoyang Huang, Tianyi Tang, Dongdong Zhang, Xin Zhao, Ting Song, Yan Xia, and Furu Wei. 2023. Not all languages are created equal in LLMs: Improving multilingual capability by cross-lingual-thought prompting. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 12365–12394, Singapore. Association for Computational Linguistics.
Ifergan et al. (2024) Maxim Ifergan, Leshem Choshen, Roee Aharoni, Idan Szpektor, and Omri Abend. 2024. Beneath the surface of consistency: Exploring cross-lingual knowledge representation sharing in llms. Preprint, arXiv:2408.10646.
Jiang et al. (2020) Zhengbao Jiang, Antonios Anastasopoulos, Jun Araki, Haibo Ding, and Graham Neubig. 2020. X-FACTR: Multilingual factual knowledge retrieval from pretrained language models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5943–5959, Online. Association for Computational Linguistics.
Kassner et al. (2021) Nora Kassner, Philipp Dufter, and Hinrich Schütze. 2021. Multilingual LAMA: Investigating knowledge in multilingual pretrained language models. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 3250–3258, Online. Association for Computational Linguistics.
Kassner and Schütze (2020) Nora Kassner and Hinrich Schütze. 2020. Negated and misprimed probes for pretrained language models: Birds can talk, but cannot fly. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7811–7818, Online. Association for Computational Linguistics.
Liang et al. (2023) Shihao Liang, Runchu Tian, Kunlun Zhu, Yujia Qin, Huadong Wang, Xin Cong, Zhiyuan Liu, Xiaojiang Liu, and Maosong Sun. 2023. Exploring format consistency for instruction tuning. Transactions on Machine Learning Research.
Limkonchotiwat et al. (2022) Peerat Limkonchotiwat, Wuttikorn Ponwitayarat, Can Udomcharoenchaikit, Ekapol Chuangsuwanich, and Sarana Nutanong. 2022. Cl-relkt: Cross-lingual language knowledge transfer for multilingual retrieval question answering. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 2141–2155.
Litschko et al. (2024) Robert Litschko, Oliver Kraus, Verena Blaschke, and Barbara Plank. 2024. Cross-dialect information retrieval: Information access in low-resource and high-variance languages. arXiv preprint arXiv:2412.12806.
Llama Team (2024) Llama Team. 2024. The llama 3 herd of models. Preprint, arXiv:2407.21783.
Malkin et al. (2022) Dan Malkin, Tomasz Limisiewicz, and Gabriel Stanovsky. 2022. A balanced data approach for evaluating cross-lingual transfer: Mapping the linguistic blood bank. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4903–4915, Seattle, United States. Association for Computational Linguistics.
McDonald et al. (2011) Ryan McDonald, Slav Petrov, and Keith Hall. 2011. Multi-source transfer of delexicalized dependency parsers. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 62–72, Edinburgh, Scotland, UK. Association for Computational Linguistics.
Mittal et al. (2023) Shubham Mittal, Keshav Kolluru, Soumen Chakrabarti, et al. 2023. mokb6: A multilingual open knowledge base completion benchmark. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 201–214.
Mizrahi et al. (2024) Moran Mizrahi, Guy Kaplan, Dan Malkin, Rotem Dror, Dafna Shahaf, and Gabriel Stanovsky. 2024. State of what art? a call for multi-prompt LLM evaluation. Transactions of the Association for Computational Linguistics, 12:933–949.
Mondshine et al. (2025) Itai Mondshine, Tzuf Paz-Argaman, and Reut Tsarfaty. 2025. Beyond english: The impact of prompt translation strategies across languages and tasks in multilingual llms. In Findings of the Association for Computational Linguistics: NAACL 2025. Association for Computational Linguistics.
Neeman et al. (2022) Ella Neeman, Roee Aharoni, Or Honovich, Leshem Choshen, Idan Szpektor, and Omri Abend. 2022. Disentqa: Disentangling parametric and contextual knowledge with counterfactual question answering. arXiv preprint arXiv:2211.05655.
Ohmer et al. (2023) Xenia Ohmer, Elia Bruni, and Dieuwke Hupkes. 2023. Separating form and meaning: Using self-consistency to quantify task understanding across multiple senses. In Proceedings of the Third Workshop on Natural Language Generation, Evaluation, and Metrics (GEM), pages 258–276, Singapore. Association for Computational Linguistics.
OpenAI (2024) OpenAI. 2024. Gpt-4 technical report. Preprint, arXiv:2303.08774.
Qi et al. (2023) Jirui Qi, Raquel Fernández, and Arianna Bisazza. 2023. Cross-lingual consistency of factual knowledge in multilingual language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 10650–10666, Singapore. Association for Computational Linguistics.
Qwen Team (2025) Qwen Team. 2025. Qwen2.5 technical report. Preprint, arXiv:2412.15115.
Rajaee and Monz (2024) Sara Rajaee and Christof Monz. 2024. Analyzing the evaluation of cross-lingual knowledge transfer in multilingual language models. arXiv preprint arXiv:2402.02099.
Ravichander et al. (2020) Abhilasha Ravichander, Eduard Hovy, Kaheer Suleman, Adam Trischler, and Jackie Chi Kit Cheung. 2020. On the systematicity of probing contextualized word representations: The case of hypernymy in BERT. In Proceedings of the Ninth Joint Conference on Lexical and Computational Semantics, pages 88–102, Barcelona, Spain (Online). Association for Computational Linguistics.
Sclar et al. (2024) Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane Suhr. 2024. Quantifying language models’ sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting. In The Twelfth International Conference on Learning Representations.
Shaham et al. (2024) Uri Shaham, Jonathan Herzig, Roee Aharoni, Idan Szpektor, Reut Tsarfaty, and Matan Eyal. 2024. Multilingual instruction tuning with just a pinch of multilinguality. In Findings of the Association for Computational Linguistics: ACL 2024, pages 2304–2317, Bangkok, Thailand. Association for Computational Linguistics.
Turc et al. (2021) Iulia Turc, Kenton Lee, Jacob Eisenstein, Ming-Wei Chang, and Kristina Toutanova. 2021. Revisiting the primacy of english in zero-shot cross-lingual transfer. arXiv preprint arXiv:2106.16171.
Wei et al. (2024) Zihao Wei, Jingcheng Deng, Liang Pang, Hanxing Ding, Huawei Shen, and Xueqi Cheng. 2024. Mlake: Multilingual knowledge editing benchmark for large language models. ArXiv, abs/2404.04990.
Zhao et al. (2024a) Yiran Zhao, Wenxuan Zhang, Guizhen Chen, Kenji Kawaguchi, and Lidong Bing. 2024a. How do large language models handle multilingualism? Preprint, arXiv:2402.18815.
Zhao et al. (2024b) Yukun Zhao, Lingyong Yan, Weiwei Sun, Guoliang Xing, Shuaiqiang Wang, Chong Meng, Zhicong Cheng, Zhaochun Ren, and Dawei Yin. 2024b. Improving the robustness of large language models via consistency alignment. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 8931–8941, Torino, Italia. ELRA and ICCL.
Zheng et al. (2023) Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. 2023. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36:46595–46623.
Zhou et al. (2025) Jin Peng Zhou, Sébastien M. R. Arnold, Nan Ding, Kilian Q. Weinberger, Nan Hua, and Fei Sha. 2025. Graders should cheat: privileged information enables expert-level automated evaluations. Preprint, arXiv:2502.10961.

Appendix A Prompts

A.1 ECLeKTic Creation

During the creation of ECLeKTic, LLMs were used for question and answer generation and for the translation of the examples from their respective source languages to all other target languages. Although human annotators verified and fixed their outputs, we also provide here the prompts that were used.

Question and Answer Generation

**Task:** Formulate a question in {lang_name} that requires a deep understanding of a given {lang_name} Wikipedia paragraph.

**Requirements:**

* **Context-Specific:** The question must be answerable solely through information presented within the paragraph, excluding general knowledge or common sense.

* **Self-Contained:** The question should be completely self-explanatory, providing all necessary context within its phrasing. Assume the reader has no access to the paragraph when answering the question.

* **Single Concrete Factual Detail:**

- The question should not require multiple answers or involve listing multiple details.

- Avoid asking about opinions, interpretations.

- In you can’t answer the question, prefer to generate another question.

- Focus on extracting a specific, concrete, factual detail that the paragraph directly states.

- Be specific:

- If you are asking about an entity be clear about it – Use full names for example.

- Mention expected granularity: If you are asking about a date, instead of asking "when", ask for a decade, year, month, date etc. If you are asking about a location, instead of asking "where", ask for a country, state, city, street, landmark etc.

- Avoid asking questions that their answers are acronyms.

Even for non-English examples keep the convention of using the English words "Paragraph", "Response", "question" and "answer" for specifying the parts being generated.

Generate only the question and answer. No need to continue with additional examples.

**Examples:**

Paragraph: The Great Barrier Reef is the world’s largest coral reef system, composed of over 2,900 individual reefs and 900 islands stretching for over 2,300 kilometers (1,400 mi) over an area of approximately 344,400 square kilometers (133,000 sq mi). The reef is located in the Coral Sea, off the coast of Queensland, Australia. The Great Barrier Reef can be seen from outer space and is the world’s biggest single structure made by living organisms.

Response: Question: Where is the Great Barrier Reef located? Answer: Coral Sea, off the coast of Queensland, Australia

Paragraph: Die Cazoo Snookerweltmeisterschaft 2023 wurde vom 15. April bis 1. Mai im Crucible Theatre in Sheffield ausgetragen. Mit ihr endete die Saison 2022/23 der World Snooker Tour.[1] Titelverteidiger Ronnie O’Sullivan scheiterte im Viertelfinale gegen Luca Brecel. Der Belgier erreichte das Finale und schlug dort den vierfachen Weltmeister Mark Selby mit 18:15. Brecel ist damit der erste Kontinentaleuropäer, der Weltmeister wurde. In diesem Jahr wurden noch weitere Bestmarken in Bezug auf die 47-jährige „Crucible-Ära“ aufgestellt. Unter anderem übertraf Ronnie O’Sullivan mit seiner 31. Endrundenteilnahme die 30 Teilnahmen von Steve Davis.[2] O’Sullivan erzielte auch sein 200. WM-Century-Break. Zweimal wurde ein Maximum Break erzielt, was es 2008 bereits einmal gegeben hatte; das „perfekte Break“ in einem WM-Finale gelang 2023 erstmals Mark Selby.

Response: Question: Gegen wen verlor Ronnie O’Sullivan im Viertelfinale der Snooker-Weltmeisterschaft 2023? Answer: Luca Brecel

Paragraph: {context}

Response:

Translation

Translate the provided text from {in_lang} to {out_lang} while maintaining the original meaning and intent. Ensure accuracy and preserve the entities and concepts expressed in the source text.

Input in {in_lang}: {text} Response in {out_lang}:

A.2 Judging Prompt

**Task:** Determine if an answer to the question is supported by a given text.

**Input (in {target_language}):** - Text - Question - Answer

**Single Word Output (in English):** - YES: Answer is derived from the text. - NO: Answer is not derived from the text.

Text: {context}

Question: {question}

Answer: {predicted_answer}

Output:

A.3 Prompts for Ablation in Section 4.4

Closed-book

{question}

General hint

Answer the following question based on your knowledge in another language.

{question}

Source language name

Answer the following question based on your knowledge in {in_lang}.

{question}

Source language title

Answer the following question based on your knowledge in {in_lang} about {original_title}.

{question}

Cross-lingual open-book

Context: {original_context}

{question}

Appendix B Per-Language Breakdown - Overall Success

Figure 5 contains per-language breakdown of the results in terms of overall success of the best performing model, Gemini 2.0 Pro, similar to the transfer score breakdown done in Section 4.2. It shows that when taking into account the ability to answer questions in their source language, some languages lead to worse performance when used as a source. For example, the low scores on the diagonal for German, Hebrew and Japanese lead to worse scores on the entire respective columns.

ECLeKTic: a Novel Challenge Set for Evaluation of Cross-Lingual Knowledge Transfer

Abstract

1 Introduction

2 The Challenge of Cross-lingual Transfer Evaluation

3 ECLeKTic

3.1 The ECLeKTic Dataset Construction

3.2 ECLeKTic Metrics

3.3 Assumptions in ECLeKTic

4 Experiments

4.1 LLMs Struggle with Knowledge Transfer

4.2 Shared Script Eases Transfer

4.3 Bigger Isn’t Necessarily Better

4.4 Hinting LLMs Into Success Is Not Easy

5 ECLeKTic Popular Pages

6 Related Work and Discussions

6.1 Types of Cross-Lingual Transfer

6.2 ECLeKTic and Cross-Lingual Consistency

7 Conclusions

8 Limitations

Time sensitivity

Number of languages

References

Appendix A Prompts

A.1 ECLeKTic Creation

A.2 Judging Prompt

A.3 Prompts for Ablation in Section 4.4

Appendix B Per-Language Breakdown - Overall Success

ECLeKTic: a Novel Challenge Set for
Evaluation of Cross-Lingual Knowledge Transfer